Note: By default most code chunks are hidden. “Code” buttons can be used to unhide the code.
This note investigates if and how systematic subsampling can be applied to imbalanced learning. The structure is as follows: to set the stage for the remainder of the analysis the first section briefly introduces the bias-variance trade-off. The following section then introduces various subsampling methods. Section 3 illustrates the improvements associated with non-uniform subsampling. The final two sections apply the developed ideas to binary classification problems with imbalanced training data. Section 5 concludes.
source("R/gauss_kernel.R")
source("R/sinusoidal.R")
source("R/sim_sinusoidal.R")
source("R/regularized_ls.R")
# Generate data: ----
n <- 25
p <- 24
a <- 0
b <- 1
x <- seq(a,b,length.out = n)
y <- sinusoidal(x)
v <- 0.3
To set the stage for the remainder of this note we will briefly revisit the bias-variance trade-off in this section. In particular we will illustrate the effect of varying the sample size \(n\). Readers familiar with this topic may choose to skip this section.
As as in Bishop (2006) we consider synthetic data generated by the sinusoidal function \(f(x)=\sin(2\pi x)\). To simulate random samples of \(\mathbf{y}\) we sample \(n\) input values from \(\mathbf{X} \sim \text{unif}(0,1)\) and introduce a random noise component \(\varepsilon \sim \mathcal{N}(0,0.3)\). Figure 1.1 shows \(\mathbf{y}\) along with random draws \(\mathbf{y}^*_n\).
# True data: ----
library(ggplot2)
library(data.table)
dt_true <- data.table(y,x)
pl <- ggplot(data=dt_true, aes(x=x, y=y)) +
geom_line()
# Simulate data: ----
n_draws <- 3
dt_star <- rbindlist(
lapply(
1:n_draws,
function(x) {
simulated <- sim_sinusoidal(n=n, sigma = v)
data.table(y=simulated$y_star, x=simulated$x_star, n=1:n, draw=x)
}
)
)
pl +
geom_point(
data = dt_star,
aes(x=x, y=y, colour=factor(draw))
) +
scale_colour_discrete(
name="Draw:"
)
Figure 1.1: Sinusoidal function and random draws.
lambda <- c(
exp(2.6),
exp(-0.31),
exp(-2.4)
)
s <- 0.1
n_draws <- 100
mu <- seq(a,b,length.out = p)
Following Bishop (2006) we will use a Gaussian linear model with Gaussian kernels \(\exp(-\frac{(x_k-\mu_p)^{2}}{2s^2})\) as
\[ \begin{equation} \begin{aligned} && \mathbf{y}|\mathbf{X}& =f(x) \sim \mathcal{N} \left( \sum_{j=0}^{p-1} \phi_j(x)\beta_j, v \mathbb{I}_p \right) \\ \end{aligned} \tag{1.1} \end{equation} \]
with \(v=0.3\) to estimate \(\hat{\mathbf{y}}_k\) from random draws \(\mathbf{X}_k\). We fix the number of kernels \(p=24\) (and hence the number of features \(M=p+1=25\)) as well as the spatial scale \(s=0.1\). To vary the complexity of the model we use a form of regularized least-squares (Ridge regression) and let the regularization parameter \(\lambda\) vary
\[ \begin{equation} \begin{aligned} && \hat\beta&=(\lambda I + \Phi^T \Phi)^{-1}\Phi^Ty \\ \end{aligned} \tag{1.2} \end{equation} \]
where high values of \(\lambda\) in (1.2) shrink parameter values towards zero. (Note that a choice \(\lambda=0\) corresponds to the OLS estimator which is defined as long as \(p \le n\).)
As in Bishop (2006) we proceed as follows for each choice of \(\lambda\) and each sample draw to illustrate the bias-variance trade-off:
Phi <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x, mu=mu_p, s = s)
}
)
)
dt <- rbindlist(
lapply( # loop - draw K times
1:n_draws,
function(k) {
# Draw:
simulated <- sim_sinusoidal(n=n, sigma = v)
y_k <- simulated$y_star
x_k <- simulated$x_star
rbindlist(
lapply( # loop over regularization parameter
1:length(lambda),
function(t) {
# Regularization parameter:
lambda_t <- lambda[t]
# Extract features:
Phi_k <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x_k, mu=mu_p, s = s)
}
)
)
beta_hat <- regularized_ls(Phi_k,y_k,lambda_t) # fit model on (y,x)
y_hat <- c(Phi %*% beta_hat) # predict from model
dt <- data.table(value=y_hat,draw=k,lambda=lambda_t,n=1:n,x=x)
return(dt)
}
)
)
}
)
)
dt[,facet_group:="single"]
dt[,colour_group:="estimates"]
# Expected values:
dt_exp = dt[,.(value=mean(value)),by=.(lambda,n,x)]
dt_exp[,facet_group:="aggregate"]
dt_exp[,colour_group:="estimates"]
dt_exp[,draw:=1] # just for aesthetics
# True values:
library(reshape)
dt_true = data.table(expand.grid.df(data.frame(value=y,x=x),data.frame(lambda=lambda)))
dt_true[,facet_group:="aggregate"]
dt_true[,colour_group:="true"]
dt_true[,draw:=2] # just for aesthetics
# Plot data:
dt_plot = rbind(
dt,
dt_exp,
dt_true,
fill=T
)
Applying the above procedure we can construct the familiar picture that demonstrates how increased model complexity increases variance while reducing bias (Figure 1.2). Recall that for the mean-squared error (MSE) we have
\[ \begin{equation} \begin{aligned} && \mathbb{E} \left( (\hat{f}_n(x)-f(x))^2 \right) &= \text{var} (\hat{f}_n(x)) + \left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2 \\ \end{aligned} \tag{1.3} \end{equation} \]
where the first term on the right-hand side corresponds to the variance of our prediction and the second term to its (squared) bias. In Figure 1.2 as model complexity increases the variance component of the MSE increases, while the bias term diminishes. A similar pattern would have been observed if instead of using regularization we had used OLS and let the number of Gaussian kernels (and hence the number of features \(p\)) vary where higher values of \(p\) correspond to increased model complexity.
dt_plot[,log_lambda := log(lambda)]
ggplot(data=dt_plot[draw<=25], aes(y=value, x=x, colour=colour_group, group=draw)) +
geom_line() +
facet_grid(
rows = vars(log_lambda),
cols = vars(facet_group)
) +
scale_color_discrete(
name="Type:"
) +
labs(
y = "f(x)"
)
Figure 1.2: Bias-variance trade-off
n <- 100
m <- c(0.10, 0.5, 0.90)
x <- seq(a,b,length.out = n)
y <- sinusoidal(x)
lambda <- exp(-0.31)
Phi <- cbind(
1,
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x, mu=mu_p, s = s)
}
)
)
dt <- rbindlist(
lapply( # loop - draw K times
1:n_draws,
function(k) {
rbindlist(
lapply( # loop over sample sizes:
1:length(m),
function(t) {
n_sample <- m[t] * n
# Draw:
simulated <- sim_sinusoidal(n=n_sample, sigma = v)
y_k <- simulated$y_star
x_k <- simulated$x_star
# Extract features:
Phi_k <- cbind(
1,
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x_k, mu=mu_p, s = s)
}
)
)
beta_hat <- regularized_ls(Phi_k,y_k,lambda) # fit model on (y,x)
y_hat <- c(Phi %*% beta_hat) # predict from model
dt <- data.table(value=y_hat,draw=k,sample_size=n_sample,n=1:n,x=x)
return(dt)
}
)
)
}
)
)
dt[,facet_group:="single"]
dt[,colour_group:="estimates"]
# Expected values:
dt_exp = dt[,.(value=mean(value)),by=.(sample_size,n,x)]
dt_exp[,facet_group:="aggregate"]
dt_exp[,colour_group:="estimates"]
dt_exp[,draw:=1] # just for aesthetics
# True values:
library(reshape)
dt_true = data.table(expand.grid.df(data.frame(value=y,x=x),data.frame(sample_size=m*n)))
dt_true[,facet_group:="aggregate"]
dt_true[,colour_group:="true"]
dt_true[,draw:=2] # just for aesthetics
# Plot data:
dt_plot = rbind(
dt,
dt_exp,
dt_true,
fill=T
)
The focus of this note is instead on varying the sample size \(n\). It should not be surprising that both the variance and bias component of the MSE decrease as the sample size \(n\) increases (Figure 1.3). But in today’s world \(n\) can potentially be very large, so much so that even computing simple linear models can be hard. Suppose for example you wanted to use patient data that is generated in real-time as a global pandemic unfolds to predict the trajectory of said pandemic. Or consider the vast quantities of potentially useful user-generated data that online service providers have access to. In the remainder of this note we will investigate how systematic subsampling can help improve model accuracy in these situations.
ggplot(data=dt_plot[draw<=25], aes(y=value, x=x, colour=colour_group, group=draw)) +
geom_line() +
facet_grid(
rows = vars(sample_size),
cols = vars(facet_group)
) +
scale_color_discrete(
name="Type:"
) +
labs(
y = "f(x)"
)
Figure 1.3: Bias-variance trade-off
source("R/UNIF.R")
source("R/BLEV.R")
source("R/OPT.R")
source("R/PL.R")
source("R/wls_qr.R")
source("R/rorthogonal.R")
source("R/sim_subsampling.R")
n <- 1000
m <- 10
The case for subsampling generally involves \(n >> p\), so very large values of \(n\). In such cases we may be interested in estimating \(\hat\beta_m\) instead of \(\hat\beta_n\) where \(p\le m<<n\) with \(m\) freely chosen by us. In practice we may want to do this to avoid high computational costs associated with large \(n\) as discussed above. The basic algorithm for estimating \(\hat\beta_m\) is simple:
But there are at least two questions about this algorithm: firstly, how do we choose \(\mathbf{X}_m=({\mathbf{X}^{(1)}}^T,...,{\mathbf{X}^{(m)}}^T)^T\)? Secondly, how should we construct \(\hat\beta_m\)? With respect to the former, a better idea than just randomly selecting \(\mathbf{X}_m\) might be to choose observations with high influence. We will look at a few of the different subsampling methods investigated and proposed in Zhu et al. (2015), which differ primarily in their choice of subsampling probabilities \(\{\pi_i\}^n_{i=1}\):
Methods involving predictor-lengths are proposed by the authors with the former shown to be optimal (more on this below). PL subsampling is shown to scale very well and a good approximation of optimal subsampling conditional on leverage scores \(h_{ii}\) being fairly homogeneous.
With respect to the second question Zhu et al. (2015) investigate both ordinary least-squares (OLS) and weighted least-squares (WLS), where weights simply correspond to subsampling probabilities \(\{\pi_i\}^n_{i=1}\). The authors present empirical evidence that OLS is more efficient than WLS in that the mean-squared error (MSE) for predicting \(\mathbf{X} \beta\) is lower for OLS. The authors also note though that subsampling using OLS is not consistent for non-uniform subsampling methods meaning that the bias cannot be controlled. Given Equation (1.3) the fact that OLS is nonetheless more efficient than WLS implies that the higher variance terms associated with WLS dominates the effect of relatively higher bias with OLS. In fact this is consistent with the theoretical results presented in Zhu et al. (2015) (more on this below).
Next we will briefly run through different estimation and subsampling methods in some more detail and see how they can be implemented in R. In the following section we will then look at how the different approaches perform empirically.
Both OLS and WLS are implemented here using QR decomposition. As for OLS this is very easily done in R. Given some feature matrix X and a corresponding outcome variable y we can use qr.solve(X, y) to compute \(\hat\beta\). For WLS we need to first weigh observations by their corresponding subsampling probabilities. Following Zhu et al. (2015) we can construct a weighting matrix \(\Phi= \text{diag}\{\pi_i\}^m_{i=1}\) and compute the weighted least-squares estimator as: (see appendix for derivation)
\[ \begin{equation} \begin{aligned} && \hat\beta_m^{WLS}&= \left( \mathbf{X}^T \Phi^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^T\Phi^{-1}\mathbf{y}\\ \end{aligned} \tag{2.1} \end{equation} \]
In R weighted least-squares can be implemented as follows
wls_qr <- function(X, y, weights) {
Phi <- diag(weights)
beta <- qr.solve(t(X) %*% Phi %*% X, t(X) %*% Phi %*% y)
return(beta)
}
where in order to implement the algorithm propose in Zhu et al. (2015) the weights we need to supply as the function arguments are \(w=\{1/\pi_i\}^m_{i=1}\). This follows from the following property of diagonal matrices:
\[ \begin{aligned} && \Phi^{-1}&= \begin{pmatrix} 1\over\phi_{11} & 0 & 0 \\ 0 & ... & 0 \\ 0 & 0 & 1\over \phi_{nn} \end{pmatrix} \\ \end{aligned} \]
A simple function for uniform leveraging in R is shown in the code chunk below. Note that to streamline the comparison of the different methods in the following section the function takes an unused argument weighted=F which for the other subsampling methods can be used to determine whether OLS or WLS should be used. Of course, with uniform leveraging the weights are all identical and hence \(\hat\beta^{OLS}=\hat\beta^{WLS}\) so the argument is passed to but not evaluated in UNIF.
UNIF <- function(X, y, m, weighted=F, rand_state=NULL) {
if (!is.null(rand_state)) {
set.seed(rand_state)
}
indices <- sample(1:n, size=m)
X_m <- X[indices,]
y_m <- y[indices]
beta_hat <- qr.solve(X_m, y_m)
y_hat <- c(X %*% beta_hat)
return(
list(
fitted = y_hat,
coeff = beta_hat
)
)
}
The UNIF function can be extended easily to the case with basic leveraging (see code below). Note that in this case the weighted argument is evaluated.
BLEV <- function(X, y, m, weighted=F, rand_state=NULL, plot_wgts=F, prob_only=F) {
svd_X <- svd(X)
U <- svd_X$u
H <- tcrossprod(U)
h <- diag(H)
prob <- h/ncol(X)
# Plot:
if (plot_wgts) {
plot(prob, t="l", ylab="Sampling probability")
}
# Output:
if (prob_only) {
return(prob)
} else {
indices <- sample(
x = 1:n,
size = m,
replace = T,
prob = prob
)
X_m <- X[indices,]
y_m <- y[indices]
weights <- 1/prob[indices]
if (weighted) {
beta_hat <- wls_qr(X_m, y_m, weights)
} else {
beta_hat <- qr.solve(X_m, y_m)
}
y_hat <- c(X %*% beta_hat)
return(
list(
fitted = y_hat,
coeff = beta_hat,
prob = prob
)
)
}
}
Recall that for the hat matrix we have
\[ \begin{equation} \begin{aligned} && \mathbf{H}&=\Phi (\Phi^T \Phi)^{-1}\Phi^T \\ \end{aligned} \tag{2.2} \end{equation} \]
where the diagonal elements \(h_{ii}\) correspond to the leverage scores we’re after. Following Zhu et al. (2015) we will use (compact) singular value decomposition to obtain \(\mathbf{H}\) rather than computing (2.2) directly. This has the benefit that there exist exceptionally stable numerical algorithms to compute SVD. To see how and why we can use SVD to obtain \(\mathbf{H}\) see appendix.
Clearly to get \(h_{ii}\) we first need to compute \(\mathbf{H}\) which in terms of computational costs is of order \(\mathcal{O}(np^2)=\max(\mathcal{O}(np^2),\mathcal{O}(p^3))\). The fact that we use all \(n\) rows of \(\Phi\) to compute leverage scores even though we explicitly stated our goal was to only use \(m\) observations may rightly seem like a bit of a paradox. This is why fast algorithms that approximate leverage scores have been proposed. We will not look at them specifically here mainly because the PL method proposed by Zhu et al. (2015) does not depend on leverage scores and promises to be computationally even more efficient.
The basic characteristic of PL subsampling - choosing \(\{\pi_i\}^n_{i=1}= ||\mathbf{X}_i||/ \sum_{j=1}^{n}||\mathbf{X}_j||\) - was already introduced above. Again it is very easy to modify the subsampling functions from above to this case:
PL <- function(X, y, m, weighted=F, rand_state=NULL, plot_wgts=F, prob_only=F) {
predictor_len <- sqrt(X**2 %*% rep(1,ncol(X)))
prob <- predictor_len/sum(predictor_len)
# Plot:
if (plot_wgts) {
plot(prob, t="l", ylab="Sampling probability")
}
# Output:
if (prob_only) {
return(prob)
} else {
indices <- sample(
x = 1:n,
size = m,
replace = T,
prob = prob
)
X_m <- X[indices,]
y_m <- y[indices]
weights <- 1/prob[indices]
if (weighted) {
beta_hat <- wls_qr(X_m, y_m, weights)
} else {
beta_hat <- qr.solve(X_m, y_m)
}
y_hat <- c(X %*% beta_hat)
return(
list(
fitted = y_hat,
coeff = beta_hat,
prob = prob
)
)
}
}
In fact, PL subsampling is an approximate version of optimal subsampling (OPT). Zhu et al. (2015) show that asymptotically we have:
\[ \begin{equation} \begin{aligned} &&\text{plim} \left( \text{var} (\hat{f}_n(x)) \right) > \text{plim} \left(\left( \mathbb{E} \left( \hat{f}_n(x) \right) - f(x) \right)^2 \right) \\ \end{aligned} \tag{2.3} \end{equation} \]
Given this result minimizing the MSE (Equation (1.3)) with respect to subsampling probabilities \(\{\pi_i\}^n_{i=1}\) corresponds to minimizing \(\text{var} (\hat{f}_n(x))\). They further show that this minimization problem has the following closed-form solution:
\[ \begin{equation} \begin{aligned} && \pi_i&= \frac{\sqrt{(1-h_{ii})}||\mathbf{X}_i||}{\sum_{j=1}^n\sqrt{(1-h_{jj})}||\mathbf{X}_j||}\\ \end{aligned} \tag{2.4} \end{equation} \]
This still has computational costs of order \(\mathcal{O}(np^2)\). But it should now be clear why PL subsampling is optimal conditional on leverage scores being homogeneous (see appendix). PL subsampling is associated with computational costs of order \(\mathcal{O}(np)\), so a potentially massive improvement. The code for optimal subsampling is shown below:
OPT <- function(X, y, m, weighted=F, rand_state=NULL, plot_wgts=F, prob_only=F) {
n <- nrow(X)
# Leverage scores:
svd_X <- svd(X)
U <- svd_X$u
H <- tcrossprod(U)
h <- diag(H)
# Euclidian norms:
predictor_len <- sqrt(X**2 %*% rep(1,ncol(X)))
# Optimal sampling probabilities:
prob <- (sqrt(1-h) * predictor_len) / crossprod(sqrt(1-h),predictor_len)[1]
# Plot:
if (plot_wgts) {
plot(prob, t="l", ylab="Sampling probability")
}
# Output:
if (prob_only) {
return(prob)
} else {
indices <- sample(
x = 1:n,
size = m,
replace = T,
prob = prob
)
X_m <- X[indices,]
y_m <- y[indices]
weights <- 1/prob[indices]
if (weighted) {
beta_hat <- wls_qr(X_m, y_m, weights)
} else {
beta_hat <- qr.solve(X_m, y_m)
}
y_hat <- c(X %*% beta_hat)
return(
list(
fitted = y_hat,
coeff = beta_hat,
prob = prob
)
)
}
}
Computing the Euclidean norms \(||\mathbf{X}_i||\) in R can be done explicitly by looping over the rows of \(\mathbf{X}\) and computing the norm in each iteration. It turns out that this computationally very expensive. A much more efficient way of computing the vector of predictor lengths is as follows
\[ \begin{equation} \begin{aligned} && \mathbf{pl}&=\sqrt{\mathbf{X}^2 \mathbf{1}} \\ \end{aligned} \tag{2.5} \end{equation} \]
where \(\mathbf{X}^2\) indicates elements squared, the square root is also taken element-wise and \(\mathbf{1}\) is a \((p \times 1)\) vectors of ones. A performance benchmark of the two approaches is shown in Figure 2.1 below.
n <- 200
p <- 100
library(microbenchmark)
X <- matrix(rnorm(n*p),nrow=n)
check_for_equality <- function(values) {
tol <- 1e-12
error <- max(abs(values[[1]] - values[[2]]))
error < tol
}
mb <- microbenchmark(
"Loop" = {sapply(1:nrow(X), function(i) norm(as.matrix(X[i,]), type="f"))},
"Matrix product" = {sqrt(X**2 %*% rep(1,ncol(X)))},
check = check_for_equality
)
autoplot(mb)
Figure 2.1: Benchmark of Euclidean norm computations.
n <- 1000
p <- 100
As discussed in Zhu et al. (2015) both OPT and PL subsampling tend to inflate subsampling probabilities of observations with low leverage scores and shrink those of high-leverage observations relative to BLEV. They show explicitly that this always holds for orthogonal design matrices. As a quick sense-check of the functions introduced above we can generate a random orthogonal design matrix \(\mathbf{X}\) and plot subsampling probabilities with OPT and PL against those obtained with BLEV. Figure 2.2 illustrates this relationship nicely. The design matrix \(\mathbf{X}\) \((n \times p)\) with \(n=1000\) and \(p=100\) was generated using SVD:
rorthogonal <- function(n,p) {
M <- matrix(rnorm(n*p), n, p)
X <- svd(M)$u
return(X)
}
X <- rorthogonal(n,p)
# Subsampling methods:
methods <- list(
"UNIF" = UNIF,
"BLEV" = BLEV,
"OPT" = OPT,
"PL" = PL
)
smpl_probs <- rbindlist(
lapply(
names(methods)[2:4],
function(i) {
method <- i
prob <- methods[[method]](X, y=NULL, m=NULL, prob_only=T)
return(data.table(prob=prob, method=method))
}
)
)
dt_plot <- rbindlist(
lapply(
names(methods)[3:4],
function(i) {
other_method <- i
x <- smpl_probs[method=="BLEV"]$prob
y <- smpl_probs[method==other_method]$prob
data.table(x=x, y=y, y_var=other_method)
}
)
)
ggplot(dt_plot, aes(x=x, y=y)) +
geom_abline(intercept = 0, slope=1, linetype="dashed") +
geom_point(shape=1) +
facet_grid(
cols=vars(y_var)
) +
labs(
x="BLEV",
y=""
)
Figure 2.2: Comparison of subsampling probabilities.
To illustrate the improvements associated with the methods proposed in Zhu et al. (2015), we will briefly replicate their main empirical findings here. The evaluate the performance of the different methods we will proceed as follows:
Empirical exercise
This exercise - and the once that follow - are computationally expensive. For them to be feasible we have to refrain from bias-correcting the in-sample estimator of the MSE (see appendix for session info). We will use a simple wrapper function to implement the empirical exercise in R.
As in Zhu et al. (2015) we will generate the design matrix \(\mathbf{X}\) from 5 different distributions: 1) Gaussian (GA) with \(\mathcal{N}(\mathbf{0},\Sigma)\); 2) Mixed-Gaussian (MG) with \(0.5\mathcal{N}(\mathbf{0},\Sigma)+0.5\mathcal{N}(\mathbf{0},25\Sigma)\); 3) Log-Gaussian (LN) with \(\log\mathcal{N}(\mathbf{0},\Sigma)\); 4) T-distribution with 1 degree of freedom (T1) and \(\Sigma\); 5) T-distribution as in 4) but truncated at \([-p,p]\). All parameters are chosen in the same way as in Zhu et al. (2015) with exception of \(n=1000\) and \(p=3\), which are significantly smaller choices in order to decrease the computational costs. The corresponding densities of the 5 data sets are shown in Figure 5.1 in the appendix.
We will run the empirical exercise for each data set and each subsampling method introduced above. Figure 3.1 shows logarithms of the sampling probabilities corresponding to the different subsampling methods (UNIF not shown for obvious reasons). The plots look very similar to the one in Zhu et al. (2015) and is shown here primarily to reassure ourselves that we have implemented their ideas correctly. One interesting observation is worth pointing out however: note how the distributions for OPT and PL have lower standard deviations compared to BLEV. This should not be altogether surprising since we already saw above that for orthogonal design matrices the former methods inflate small leverage scores while shrinking high scores. But it is interesting to see that the same appears to hold for design matrices that are explicitly not orthogonal given our choice of \(\Sigma\).
set.seed(1)
library(expm)
matrix_grid <- expand.grid(i=1:p,j=1:p)
Sigma <- matrix(rep(0,p^2),p,p)
for (x in 1:nrow(matrix_grid)) {
i <- matrix_grid$i[x]
j <- matrix_grid$j[x]
Sigma[i,j] <- 2 * (0.8)^(abs(i-j))
}
# 1.) Design matrix (as in Zhu et al): ----
GA <- matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Gaussian mixture:
gaus_mix <- list(
gaus_1 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm(t(Sigma)),
gaus_2 = matrix(rnorm(n*p), nrow = n, ncol = p) %*% sqrtm((25 * t(Sigma)))
)
MG <- matrix(rep(0,n*p),n,p)
for (i in 1:nrow(MG)) {
x <- sample(1:2,1)
MG[i,] <- gaus_mix[[x]][i,]
}
# Log-Gaussian:
LN <- exp(GA)
# T-distribution:
T1 <- matrix(rt(n*p,1), nrow = n, ncol = p) %*% sqrtm(t(Sigma))
# Truncated T:
TT <- T1
TT[TT>p] <- p
TT[TT<(-p)] <- -p
data_sets <- list(
GA = list(X = GA),
MG = list(X = MG),
LN = list(X = LN),
TT = list(X = TT),
T1 = list(X = T1)
)
# 2.) Outcome:
data_sets <- lapply(
data_sets,
function(i) {
X <- i[["X"]]
beta <- c(rep(1,ceiling(0.6*p)),rep(0.1,floor(0.4*p)))
eps <- rnorm(n=n,mean=0,sd=10)
y <- X %*% beta + eps
list(X=X, y=y)
}
)
saveRDS(data_sets, "data/synthetic_data.rds")
smpl_probs <- readRDS("outputs/smpl_probs.rds")
pgrid <- expand.grid(method = names(methods)[3:4], data=names(data_sets))
dt_plot <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
other_method <- as.character(pgrid$method[i])
data_set <- pgrid$data[i]
x <- smpl_probs[method=="BLEV" & data==data_set]$prob
y <- smpl_probs[method==other_method & data==data_set]$prob
data.table(x=x, y=y, y_var=other_method, data=data_set)
}
)
)
ggplot(dt_plot, aes(x=x, y=y)) +
geom_abline(intercept = 0, slope=1, linetype="dashed") +
geom_point(shape=1) +
facet_wrap(
y_var ~ data,
scales = "free",
ncol=length(data_sets)
) +
labs(
x="BLEV",
y=""
)
smpl_probs <- readRDS("outputs/smpl_probs.rds")
ggplot(
data = smpl_probs,
aes(x=data, y=log(prob))
) +
geom_boxplot() +
facet_grid(
col=vars(method)
)
Figure 3.1: Sampling probabilities for different subsampling methods.
# Simulation parameters:
m <- c()
counter <- 1
while(length(m)<7 & max(m/n)<.1) {
temp <- max(p,(2^(counter)*1e-3) * n)
m <- unique(c(m, temp))
counter <- counter + 1
}
weighted <- c(T,F)
pgrid <- data.table(
expand.grid(
m=m,
method=names(methods),
weighted=weighted,
data_set=names(data_sets)
)
)
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
m <- pgrid[i,m]
estimator_name <- pgrid[i,method]
estimator <- methods[[estimator_name]]
weighted <- pgrid[i, weighted]
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- data_sets[[data_set]]$y
output <- sim_subsampling(X, y, m, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
output[,data_set:=data_set]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("OLS","WLS")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_zhu.rds")
grid_search <- readRDS("outputs/grid_search_zhu.rds")
p_list <- lapply(
1:2,
function(i) {
ggplot(
data=grid_search[weighted==levels(weighted)[i]],
aes(x=m/n, y=log_value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows = vars(variables),
cols = vars(data_set),
scales = "free_y"
) +
scale_color_discrete(
name="Subsampling:"
) +
labs(
x="m/n",
y="Logarithm",
title=levels(grid_search$weighted)[i]
)
}
)
names(p_list) <- levels(grid_search$weighted)
Figures 3.2 and 3.3 show the resulting MSE, squared bias and variance for the different subsampling methods and data sets using weighed least-squares and ordinary least-squares, respectively. The subsampling size increases along the horizontal axis. The figures are interactive to allow readers to zoom in etc.
For the data sets that are also shown in Zhu et al. (2015) we find the same overall pattern: PL and OPT outperform other methods when using weighted least-squares, while BLEV outperforms other methods when using unweighted/ordinary least-squares.
For Gaussian data (GA) the differences between the methods are minimal since data points are homogeneous. A similar picture emerges when running the method comparison for the sinusoidal data introduced above (see appendix). In fact, Zhu et al. (2015) recommend to just rely on uniform subsampling when data is Gaussian. Another interesting observation is that for t-distributed data (T1) the non-uniform subsampling methods significantly outperform uniform subsampling methods. This is despite the fact that in the case of T1 data the conditions used to establish asymptotic consistency of the non-uniform subsampling methods in Zhu et al. (2015) are not fulfilled: in particular the fourth moment is not finite (in fact it is not defined).
library(plotly)
layout_ggplotly <- function(gg, x = -0.075, y = -0.025, x_legend=1.05, y_legend=0.95, mar= list(l=50,r=150,b=50)){
# The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
gg[['x']][['layout']][['annotations']][[1]][['y']] <- x
gg[['x']][['layout']][['annotations']][[2]][['x']] <- y
gg[['x']][['layout']][['annotations']][[11]][['x']] <- x_legend
gg[['x']][['layout']][['legend']][['y']] <- y_legend
gg[['x']][['layout']][['legend']][['x']] <- x_legend
gg %>% layout(margin = mar)
}
gg <- ggplotly(p_list$WLS)
layout_ggplotly(gg)
Figure 3.2: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted least-squares.
gg <- ggplotly(p_list$OLS)
layout_ggplotly(gg)
Figure 3.3: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary least-squares.
n <- 200
p <- 100
We have already seen above that theoretically speaking both BLEV and OPT subsampling are computationally more expensive than PL subsampling (with UNIF subsampling the least expensive). It should be obvious that in light of their computational costs \(\mathcal{O}(np^2)\) the former two methods do not scale well in higher-dimensional problems (higher \(p\)). Zhu et al. (2015) demonstrate this through empirical exercises to an extent that is beyond the scope of this note. Instead we will just quickly benchmark the different functions for non-uniform subsampling introduced above: BLEV, OPT, PL for \(n=200\), \(p=100\). We are only interested in how long it takes to compute subsampling probabilities, and since for UNIF all subsampling probabilities are simply \(\pi_i=1/n\) we neglect this here. Figure 3.4 benchmarks the three non-uniform subsampling methods. Evidently PL subsampling is computationally much less costly.
library(microbenchmark)
X <- matrix(rnorm(n*p),nrow=n)
y <- rnorm(n)
mb <- microbenchmark(
"BLEV" = {BLEV(X,y,prob_only = T)},
"OPT" = {OPT(X,y,prob_only = T)},
"PL" = {PL(X,y,prob_only = T)}
)
autoplot(mb)
Figure 3.4: Benchmark of computational performance of different methods.
source("R/undersampling.R")
source("R/sim_undersampling.R")
source("R/logit_irls.R")
In binary classification problems we are often faced with the issue of imbalanced training data - one of the two classes is under-represented relative to the other. This generally makes classifiers less sensitive to the minority class which often is the class we want to predict (Branco, Torgo, and Ribeiro (2015)). Suppose for example we wanted to predict the probability of death for patients who suffer from COVID-19. The case-fatality rate for the virus is significantly lower than 10% so any data we could obtain on patients would inevitably be imbalanced: the domain is skewed towards the class we are not interested in predicting.
A common and straight-forward way to deal with this issue is to randomly over- or under-sample the training data. Let \(y_i\in{0,1}\) for all \(i=1,...,n\). We are interested in modelling \(p_i=P(y_i=1|x_i)\) but our data is imbalanced: \(n_{y=0}>>n_{y=1}\) where \(n=n_{y=0}+n_{y=1}\). Then random over- and under-sampling works as follows:
In a way both these methods correspond to uniform subsampling (UNIF) discussed above. Random oversampling may lead to overfitting. Conversely, random undersampling may come at the cost of eliminating observations with valuable information. With respect to the latter, we have already shown that more systematic subsampling approaches generally outperform uniform subsampling in linear regression problems. It would be interesting to see if we can apply similar ideas to classification with imbalanced data. How exactly we can go about doing this should be straight-forward to see:
To remain in the subsampling framework we will focus on undersampling here. It should be noted that many systematic approaches to undersampling already exist. In their extensive survey Branco, Torgo, and Ribeiro (2015) mention undersampling methods based on distance-criteria, condensed nearest neighbours as well as active learning methods. Optimal subsampling is not mentioned in the survey.
We cannot apply the methods used for subsampling with linear regression directly to classification problems, although we will see that certain ideas still apply. Wang, Zhu, and Ma (2018) explore optimal subsampling for large sample logistic regression - a paper that is very much related to Zhu et al. (2015). We will not cover the details here as much as we did for Zhu et al. (2015) and instead jump straight to one of the main results. Wang, Zhu, and Ma (2018) and propose a two-step algorithm that in the first step estimates logistic regression with uniform subsampling using \(m_0\) observations. In the second step the estimated coefficient \(\beta^{(UNIF)}_{m_0}\) from the first step is used to compute subsampling probabilities as follows:
\[ \begin{equation} \begin{aligned} && \pi_i&= \frac{|y_i-p_i(\beta^{(UNIF)}_{m_0})|||\mathbf{X}_i||}{\sum_{j=1}^n|y_i-p_i(\beta^{(UNIF)}_{m_0})|||\mathbf{X}_j||}\\ \end{aligned} (\#eq:subs_mVc) \end{equation} \]
The probabilities are then used to subsample \(m\) observations again. Finally, \(\hat\beta_{m_0+m}\) is estimated from the combdined subsamples with weighted logistic regression where weights once again correspond to inverse sampling probabilities. The authors refer to the method corresponding to @ref{eq:subs_mVc} as mVc since it corresponds to optimal subsampling with variance targeting. They also propose a method which targets the MSE directly, but this one is of higher computational costs \(\mathcal{O}(np^2)\) vs. \(\mathcal{O}(np)\) for mVc. The similarities to the case for linear regression should be obvious. We will only consider mVc here and see if it can be applied to undersampling. The code that implements this can be found here - it uses object-oriented program so could be interesting for some readers. Wang, Zhu, and Ma (2018) never run the model with ordinary as opposed to weighted logit, possibly because ordinary logit is not sensible in this context. But for the sake of completeness we’ll once again consider both cases here.
n <- length(data_sets$GA$y)
prob <- 0.05
In this section we will use the same synthetic data sets as above for our design matrix \(\mathbf{X}\). But while above we sampled \(y_i |x_i\sim \mathcal{N}(\beta^T x_i,\sigma)\), here we will simulate a single draw from \(\mathbf{y} |\mathbf{X} \sim \text{Bernoulli}(p)\), where in order to create imbalance we let \(p<0.5\) vary. To model the probability of \(y=1\) we will use logistic regression:
\[ \begin{equation} \begin{aligned} && \mathbf{p}&= \frac{ \exp( \mathbf{X} \beta )}{1 + \exp(\mathbf{X} \beta)} \end{aligned} \tag{4.1} \end{equation} \]
Equation (4.1) is not estimated directly but rather derived from linear predictions
\[ \begin{equation} \begin{aligned} \log \left( \frac{\mathbf{p}}{1-\mathbf{p}} \right)&= \mathbf{X} \beta \\ \end{aligned} \tag{4.2} \end{equation} \]
where \(\beta\) can be estimated through iterative re-weighted least-squares (IRLS) which is a simple implementation of Newton’s method (see for example Wasserman (2013); a complete derivation can also be found in the appendix):
\[ \begin{equation} \begin{aligned} && \beta_{s+1}&= \left( \mathbf{X}^T \mathbf{W} \mathbf{X} \right) \mathbf{X}^T \mathbf{W}z\\ \text{where}&& z&= \mathbf{X}\beta_{s} + \mathbf{W}^{-1} (\mathbf{y}-\mathbf{p}) \\ && \mathbf{W}&= \text{diag}\{p_i(\beta_{s})(1-p_i(\beta_{s}))\}_{i=1}^n \\ \end{aligned} \tag{4.3} \end{equation} \]
In R this can be implemented from scratch as below. For the empirical exercises we will rely on glm([formula], family="binomial") which scales much better to higher dimensional problems than my custom function and also implements weighted logit.
logit_irls <- function(X, y, beta_0=NULL, tau=1e-9, max_iter=10000) {
if(!all(X[,1]==1)) {
X <- cbind(1,X)
}
p <- ncol(X)
n <- nrow(X)
# Initialization: ----
if (is.null(beta_0)) {
beta_latest <- matrix(rep(0, p)) # naive first guess
}
W <- diag(n)
can_still_improve <- T
iter <- 1
# Iterative reweighted least-squares (IRLS):
while(can_still_improve & iter < max_iter) {
y_hat <- X %*% beta_latest
p_y <- exp(y_hat)/(1+exp(y_hat))
df_latest <- crossprod(X,y-p_y) # gradient
diag(W) <- p_y*(1-p_y)
Z <- X %*% beta_latest + qr.solve(W) %*% (y-p_y)
beta_latest <- qr.solve(crossprod(X,W%*%X),crossprod(X,W%*%Z))
can_still_improve <- mean(abs(df_latest))>tau # convergence reached?
iter <- iter + 1
}
return(
list(
fitted = p_y,
coeff = beta_latest
)
)
}
When thinking about repeating the empirical exercise from Section 3 above, we are faced with the question of how to compute the MSE and its bias-variance decomposition. One idea could be to compare the linear predictions in (4.2) in the same way as we did before, since after all these are fitted values from a linear model. The problem with that idea is that rebalancing the data through undersampling affects the intercept term \(\beta_0\) (potentially quite a bit if the data was originally very imbalanced). This is not surprising since \(\beta_0\) measures the log odds of \(y_i\) given all predictors \(\mathbf{X}_i=0\) are zero. Hence comparing linear predictors \(\mathbf{X}\beta_n\) and \(\mathbf{X}\beta_m\) is not a good idea in this setting (perhaps ever?). Fortunately though we can just work with the predicted probabilities in (4.1) directly and decompose the MSE in the same way as before (see Manning, Schütze, and Raghavan (2008), pp. 310):
\[ \begin{equation} \begin{aligned} && \mathbb{E} \left( (\hat{\mathbf{p}}-\mathbf{p})^2 \right) &= \text{var} (\hat{\mathbf{p}}) + \left( \mathbb{E} \left( \hat{\mathbf{p}} \right) - \mathbf{p} \right)^2 \\ \end{aligned} \tag{4.4} \end{equation} \]
Hence the empirical exercise in this section is very similar to the one above and can be summarized as follows:
Empirical exercise
There is a decisive difference between this exercise and the one we ran for linear regression models above: here we are mechanically reducing the number of observations of the majority class and with non-uniform subsampling we are selective about what observations we keep. This is very different to simply subsampling without imposing a structure on the subsample. Intuitively it may well be that in this context non-uniform subsampling is prone to produce estimates of \(\hat\beta_m\) that are very different from \(\hat\beta_n\). This effect could be expected to appear in the squared bias term of the MSE.
The resulting mean-squared error decompositions are shown in Figures 4.1 and 4.2 for weighted and ordinary logit. A first interesting and reassuring observation is that both for weighted and ordinary logit the variance component is generally reduced more with mVc subsampling (certainly less true for weighted logit). A second observation is that the squared bias term can evidently not be controlled with mVc which leads to the overall mean-squared errors generally being higher than with random undersampling. This could be evidence that selective subsampling in fact introduces bias as speculated above, but this is not clear and would have to be explored more thoroughly.
It could also very well be that the algorithm in Wang, Zhu, and Ma (2018) was not implemented correctly here and the results are therefore incorrect. Unfortunately I have not had time to try and completely replicate that second paper also.
# Simulation parameters:
data_sets <- tryCatch(
data_sets,
error = function(e) {
readRDS("data/synthetic_data.rds")
}
)
n <- 1000
m <- 2^(3:7)*1e-3
weighted <- c(T,F)
methods <- list(
"UNIF" = UNIF,
"mVc" = mVc
)
pgrid <- data.table(
expand.grid(
m=m,
method=names(methods),
weighted=weighted,
data_set=names(data_sets)
)
)
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(pgrid),
function(i) {
m <- pgrid[i,m]
estimator_name <- pgrid[i,method]
estimator <- methods[[estimator_name]]
weighted <- pgrid[i, weighted]
data_set <- pgrid$data[i]
X <- data_sets[[data_set]]$X
y <- sample(c(1,0),size=n,rep=T,prob = c(m,1-m))
vars <- undersampler(X,y) # undersampler instance
output <- sim_undersampling(vars, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
output[,data_set:=data_set]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("Logit","Weighted Logit")
grid_search[,log_value:=log(value)]
saveRDS(grid_search, file="outputs/grid_search_logit.rds")
grid_search <- readRDS("outputs/grid_search_logit.rds")
p_list <- lapply(
1:2,
function(i) {
ggplot(
data=grid_search[weighted==levels(weighted)[i]],
aes(x=m, y=value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows = vars(variables),
cols = vars(data_set),
scales = "free_y"
) +
scale_color_discrete(
name="Subsampling:"
) +
labs(
x="Proportion of sample",
y="Value",
title=levels(grid_search$weighted)[i]
)
}
)
names(p_list) <- levels(grid_search$weighted)
gg <- ggplotly(p_list$`Weighted Logit`)
layout_ggplotly(gg, y=-0.05)
Figure 4.1: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with weighted logistic regression. MSE and its components correspond to linear predictions of logistic regression.
gg <- ggplotly(p_list$Logit)
layout_ggplotly(gg, y=-0.05)
Figure 4.2: MSE, squared bias and variance for different subsampling methods and data sets. Subsampling with ordinary logistic regression. MSE and its components correspond to linear predictions of logistic regression.
Until now we have merely looked at approximations of \(\hat\beta_n\) for logistic regression with imbalanced data. But ultimately the goal in classification problems is to accurately predict \(\mathbf{y}\) from test data. To investigate this point we will next turn to real data.
dt <- fread("data/latestdata.csv")
dt <- dt[,.(ID, age, sex, latitude, longitude, outcome)]
for(col in names(dt)) {
set(dt, i=which(dt[[col]]==""), j=col, value=NA)
}
dt <- na.omit(dt)
dt[,patient_id:=.I][,ID:=NULL]
# Preprocess outcome variable:
death_states <- unique(dt$outcome)[grepl("die",unique(dt$outcome), ignore.case = T) |
grepl("dea",unique(dt$outcome), ignore.case = T) |
grepl("decea",unique(dt$outcome), ignore.case = T)]
dt[,y:=ifelse(outcome %in% death_states,1,0)][,outcome:=NULL]
# Preprocess features:
dt[,age:=mean(as.numeric(unlist(strsplit(age,"-")))),by=patient_id]
dt[,age:=as.numeric(age)]
dt[,sex:=ifelse(sex=="female",1,0)]
saveRDS(dt, "data/dt_preprocessed.rds")
dt <- readRDS("data/dt_preprocessed.rds")
share_ones <- table(dt$y)[2]/nrow(dt)*100
A number of interesting data sets related to COVID-19 have been made publicly available for research. An interesting data set for our purposes that contains information about individual patients is being maintained for the Open COVID-19 Data Working Group. The methodology for building the data set was originally developed by was developed by Xu et al. (2020). Information about how the data is being maintained and updated can be found in Group (2020). The complete data set is vast, but for our purposes we are only interested in the subset of data that contains information about the patient’s outcome state. The subset of the data we will use for our prediction exercise here is shown below. It contains the patient outcome variable \(y\in \{0=\text{survived}, 1=\text{deceased}\}\) which we are interested in predicting from a set of features: age, gender and geographical location. The model is kept intentionally simple, as our goal here continues to be the evaluation of subsampling methods, rather than building an accurate prediction.
library(DT)
datatable(
dt,
class = 'cell-border stripe',
filter = 'top',
options = list(
pageLength = 5,
autoWidth = TRUE,
scrollX = TRUE
)
)
The outcome variable is defined as \(y\in\{0=\text{survived},1=\text{deceased}\}\) and is highly imbalanced as expected: deaths make up about 4 percent of all outcomes. Note also that we are now looking at a significantly higher value for \(n\) than in the synthetic examples above. Computing leverage scores in this setting is extremely costly for reasons we have already elaborated on above and hence basic leveraging (BL) and optimal subsampling (OPT) are both omitted from the remainder of the analysis. To compare the two remaining approaches - uniform (UNIF) and prediction-length (PL) subsampling - we will proceed as follows:
methods <- list(
"UNIF" = UNIF,
"mVc" = mVc
)
weighted <- c(T,F)
cv_splits <- function(N, n_folds) {
index <- 1:N
shuffled <- sample(index,N,F)
folds <- split(
as.numeric(shuffled),
f = rep_len(1:n_folds, N )
)
return(folds)
}
cv_fold <- 5
B <- 100
X_cols <- colnames(dt)[!colnames(dt)%in%c("y","patient_id")]
X <- as.matrix(dt[,.SD,.SDcols=X_cols])
y <- dt$y
pgrid <- data.table(expand.grid(method=names(methods), weighted=weighted))
To score our probabilistic predictions we will once again use the mean-squared error, although technically in this setting we would speak of the Brier score (see here):
\[ \begin{equation} \begin{aligned} && BS&= \frac{1}{n} \sum_{i=1}^{n} (p_i-\hat{p}_i)^2 \\ \end{aligned} \tag{4.5} \end{equation} \]
Of course other scoring functions are commonly considered for classification problems (for a discussion see for example here). But using the Brier score naturally continues the discussion above.
Empirical exercise
Cross-validate with \(k=5\) folds and for each fold repeat \(B=100\) times:
grid_search <- rbindlist(
lapply(
1:length(folds),
function(k) {
# Perform train-test split
idx_test <- as.numeric(folds[[k]])
idx_train <- as.numeric(unlist(folds[-k]))
X_train <- X[idx_train,]
y_train <- y[idx_train]
X_test <- X[idx_test,]
y_test <- y[idx_test]
vars <- undersampler(X_train,y_train)
output <- rbindlist(
lapply( # Loop over grid values:
1:nrow(pgrid),
function(i) {
method_name <- pgrid[i,method]
weighted <- pgrid[i,weighted]
output <- rbindlist(
lapply( # Repeat B times:
1:B,
function(b) {
Z <- methods[[method_name]](vars, fit_model=F)
X <- Z$X
y <- Z$y
weights <- Z$weights
# Train model:
if (weighted & method_name!="UNIF") {
fitted_model <- glm(y ~ X, family = "binomial",
weights = weights)
} else {
fitted_model <- glm(y ~ X, family = "binomial")
}
y_hat <- cbind(1,X_test) %*% fitted_model$coefficients
y_hat <- exp(y_hat)/(1+exp(y_hat))
output <- data.table(
y = y_test,
y_hat = c(y_hat),
fold = k,
weighted = weighted,
b = b,
method = method_name,
n = 1:length(y_test)
)
return(output)
}
)
)
return(output)
}
)
)
return(output)
}
)
)
grid_search[,brier:=mean((y_hat-y)^2,na.rm=T),by=.(method, weighted, b, fold)]
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("Logit","Weighted Logit")
saveRDS(grid_search, file="outputs/grid_search_covid_adj.rds")
Figure 4.3 shows the distribution of Brier scores resulting from our pseudo-out-of-sample predictions. Undersampling is performed both randomly and with mVc subsampling. The logit classifier is fitted separately with and without subsampling probabilities as weights. In the case of ordinary logit mVc seems to lead to slightly better predictions on average, but the improvement is not very significant. When the model is fitted with weights something interesting happens. The majority of the estimated Brier scores are lower than those obtained with random undersampling, but there is much higher variance and in some cases mVc performs very poorly. This could imply that using optimal subsampling for undersampling with weighted regression is risky: it performs well most of the time, but sometimes things go very wrong. Of course, this is a just a single example and the findings shown here are not sufficient to draw any conclusions.
grid_search <- readRDS("outputs/grid_search_covid_adj.rds")
dt_plot <- unique(grid_search[,.(method, brier, weighted)])
ggplot(
data=dt_plot,
aes(x=method, y=brier)
) +
geom_boxplot() +
facet_wrap(
~ weighted,
scales="free"
) +
labs(
x="Method",
y="Brier score"
)
Figure 4.3: Distribution of Brier scores from pseudo-out-of-sample predictions for random and prediction-length undersampling.
In this short note we have looked at various systematic subsampling methods and how they can be applied to imbalanced learning. The fact that non-uniform subsampling can improve performance of linear regression models has been well established (Zhu et al. (2015)). We reviewed these findings here and then applied the developed idea to a binary classification problem with imbalanced data. The findings are mixed: it does appear that optimal subsampling methods can be successfully applied to undersampling in the sense that they can improve modely accuracy. While these findings are potentially interesting they are not conclusive. To establish general results theory would need to be developed. It is also far from clear whether the undersampling approaches presented here can compete with existing methodologies that involve distance-criteria, condensed nearest neighbours and active learning methods.
Weighted least-squares
From SVD to leverage
From OPT to PL
data_density <- rbindlist(
lapply(
1:length(data_sets),
function(i) {
x <- c(data_sets[[i]]$X)
x_bar <- mean(c(data_sets[[i]]$X))
sigma <- sd(c(data_sets[[i]]$X))
x_std <- (x - x_bar) / sigma
data.table(value = x_std, data_set=names(data_sets)[i])
}
)
)
ggplot(data_density, aes(value)) +
geom_density() +
facet_wrap(
~ data_set,
scales = "free"
) +
labs(
x="x (standardized)",
y="Density"
)
Figure 5.1: Densities of synthetic design matrices.
# Function parameters:
n <- 1000
p <- 9
a <- 0
b <- 1
x <- seq(a,b,length.out = n)
y_true <- sinusoidal(x)
y <- sim_sinusoidal(n)$y
v <- 0.3
mu <- seq(a,b,length.out = p)
Phi <- cbind(
rep(1,n),
sapply(
1:length(mu),
function(p) {
mu_p <- mu[p]
gauss_kernel(x=x, mu=mu_p, s = s)
}
)
)
# Simulation parameters:
m <- c()
counter <- 1
while(length(m)<7 & max(m/n)<.1) {
temp <- max(p+1,(2^(counter)*1e-3) * n)
m <- unique(c(m, temp))
counter <- counter + 1
}
weighted <- c(T,F)
param_grid <- data.table(expand.grid(m=m, method=names(methods), weighted=weighted))
set.seed(1)
grid_search <- rbindlist(
lapply(
1:nrow(param_grid),
function(i) {
m <- param_grid[i,m]
estimator_name <- as.character(param_grid[i,method])
estimator <- methods[[estimator_name]]
weighted <- param_grid[i, weighted]
output <- sim_subsampling(Phi, y, m, estimator, weighted=weighted)
output[,m:=m]
output[,method:=estimator_name]
output[,weighted:=weighted]
}
)
)
grid_search[,weighted:=factor(weighted)]
levels(grid_search$weighted) <- c("OLS","WLS")
grid_search[,log_value:=log(value)]
grid_search[,method := factor(method, levels=c("UNIF", "BLEV", "OPT", "PL"))]
saveRDS(grid_search, file="outputs/grid_search_sinus.rds")
grid_search <- readRDS("outputs/grid_search_sinus.rds")
ggplot(
data=grid_search,
aes(x=m/n, y=log_value, colour=method)
) +
geom_line() +
geom_point() +
facet_grid(
rows=vars(variables),
cols = vars(weighted)
) +
scale_color_discrete(
name="Subsampling method:"
) +
labs(
x="m/n",
y="Logarithm"
)
Iterative reweighted least-squares
utils::sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.16 plotly_4.9.2.1 expm_0.999-5
## [4] Matrix_1.2-18 microbenchmark_1.4-7 reshape_0.8.8
## [7] data.table_1.13.0 ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 highr_0.8
## [5] plyr_1.8.6 tools_4.0.2 digest_0.6.27 viridisLite_0.3.0
## [9] jsonlite_1.7.1 lattice_0.20-41 evaluate_0.14 lifecycle_0.2.0
## [13] tibble_3.0.3 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.8
## [17] crosstalk_1.1.0.1 yaml_2.2.1 xfun_0.18 httr_1.4.2
## [21] withr_2.3.0 stringr_1.4.0 dplyr_1.0.2 knitr_1.30
## [25] htmlwidgets_1.5.2 generics_0.0.2 vctrs_0.3.4 grid_4.0.2
## [29] tidyselect_1.1.0 glue_1.4.2 R6_2.4.1 rmarkdown_2.4
## [33] bookdown_0.20 tidyr_1.1.2 purrr_0.3.4 farver_2.0.3
## [37] magrittr_1.5 scales_1.1.1 ellipsis_0.3.1 htmltools_0.5.0
## [41] colorspace_1.4-1 labeling_0.3 stringi_1.5.3 lazyeval_0.2.2
## [45] munsell_0.5.0 crayon_1.3.4
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. springer.
Branco, Paula, Luis Torgo, and Rita Ribeiro. 2015. “A Survey of Predictive Modelling Under Imbalanced Distributions.” arXiv Preprint arXiv:1505.01658.
Group, Open COVID-19 Data Working. 2020. “Detailed Epidemiological Data from the COVID-19 Outbreak.” Accessed on yyyy-mm-dd from http://virological.org/t/epidemiological-data-from-the-ncov-2019-outbreak-early-descriptions-from-publicly-available-data/337.
Manning, Christopher D, Hinrich Schütze, and Prabhakar Raghavan. 2008. Introduction to Information Retrieval. Cambridge university press.
Wang, HaiYing, Rong Zhu, and Ping Ma. 2018. “Optimal Subsampling for Large Sample Logistic Regression.” Journal of the American Statistical Association 113 (522): 829–44.
Wasserman, Larry. 2013. All of Statistics: A Concise Course in Statistical Inference. Springer Science & Business Media.
Xu, Bo, Bernardo Gutierrez, Sumiko Mekaru, Kara Sewalk, Lauren Goodwin, Alyssa Loskill, Emily Cohn, et al. 2020. “Epidemiological data from the COVID-19 outbreak, real-time case information.” Scientific Data 7 (106). https://doi.org/doi.org/10.1038/s41597-020-0448-0.
Zhu, Rong, Ping Ma, Michael W Mahoney, and Bin Yu. 2015. “Optimal Subsampling Approaches for Large Sample Linear Regression.” arXiv, arXiv–1509.